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Abstract 

We present and analyze an agnostic active learning algorithm that works without keeping a version 
space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained 
throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space 
approach, our algorithm sheds the computational burden and brittleness associated with maintaining 
version spaces, yet still allows for substantial improvements over supervised learning for classification. 

1 Introduction 

In active learning, a learner is given access to unlabeled data and is allowed to adaptively choose which ones 
to label. This learning model is motivated by applications in which the cost of labeling data is high relative 
to that of collecting the unlabeled data itself. Therefore, the hope is that the active learner only needs to 
query the labels of a small number of the unlabeled data, and otherwise perform as well as a fully supervised 
learner. In this work, we are interested in agnostic active learning algorithms for binary classification that 
are provably consistent, i.e. that converge to an optimal hypothesis in a given hypothesis class. 

One technique that has proved theoretically profitable is to maintain a candidate set of hypotheses 
(sometimes called a version space), and to query the label of a point only if there is disagreement within this 
set about how to label the point. The criteria for membership in this candidate set needs to be carefully 
defined so that an optimal hypothesis is always included, but otherwise this set can be quickly whittled 
down as more labels are queried. This technique is perhaps most readily understood in the noise-free setting 
[CAL94i IDas05j . and it can be extended to noisy settings by using empirical confidence bounds |BBL061 
IDHM071 IBDL091 IHan09l lKoT09] . 

The version space approach unfortunately has its share of significant drawbacks. The first is computa- 
tional intractability: maintaining a version space and guaranteeing that only hypotheses from this set are 
returned is difficult for linear predictors and appears intractable for interesting nonlinear predictors such 
as neural nets and decision trees jCAL94j . Another drawback of the approach is its brittleness: a single 
mishap (due to, say, modeling failures or computational approximations) might cause the learner to exclude 
the best hypothesis from the version space forever; this is an ungraceful failure mode that is not easy to 
correct. A third drawback is related to sample re- usability: if (labeled) data is collected using a version 
space-based active learning algorithm, and we later decide to use a different algorithm or hypothesis class, 
then the earlier data may not be freely re-used because its collection process is inherently biased. 

Here, we develop a new strategy addressing all of the above problems given an oracle that returns an 
empirical risk minimizing (ERM) hypothesis. As this oracle matches our abstraction of many supervised 
learning algorithms, we believe active learning algorithms built in this way are immediately and widely 
applicable. 
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Our approach instantiates the importance weighted active learning framework of |BDL09j using a rejection 
threshold similar to the algorithm of |DHM07| which only accesses hypotheses via a supervised learning 
oracle. However, the oracle we require is simpler and avoids strict adherence to a candidate set of hypotheses. 
Moreover, our algorithm creates an importance weighted sample that allows for unbiased risk estimation, 
even for hypotheses from a class different from the one employed by the active learner. This is in sharp 
contrast to many previous algorithms {e.g., |CAL94[ IBBL061 IBBZ071 IDHM071 IHan09[ lKol09) ) that create 
heavily biased data sets. We prove that our algorithm is always consistent and has an improved label 
complexity over passive learning in cases previously studied in the literature. We also describe a practical 
instantiation of our algorithm and report on some experimental results. 

1.1 Related Work 

As already mentioned, our work is closely related to the previous works of DHM07 and |BDL09j . both of 
which in turn draw heavily on the work of [C AL94] and [BBL06 . The algorithm from [DHM07] extends 
the selective sampling method of [CAL94j to the agnostic setting using generalization bounds in a manner 
similar to that first suggested in |BBL06j . It accesses hypotheses only through a special ERM oracle that 
can enforce an arbitrary number of example-based constraints; these constraints define a version space, 
and the algorithm only ever returns hypotheses from this space, which can be undesirable as we previously 
argued. Other previous algorithms with comparable performance guarantees also require similar example- 
based constraints (e.g., [BBL06, BDL09 ; Han09 ( Kol09 ). Our algorithm differs from these in that (i) it never 
restricts its attention to a version space when selecting a hypothesis to return, and (ii) it only requires an 
ERM oracle that enforces at most one example-based constraint, and this constraint is only used for selective 
sampling. Our label complexity bounds are comparable to those proved in |BDL09| (though somewhat worse 
that those in }BBL06l IDHM071 IHantM lKol09p . 

The use of importance weights to correct for sampling bias is a standard technique for many machine 
learning problems (e.g., [SB981 lACBFSOl ISKM07] ) including active learning [SugQ5| [Bac06l IBDL09j . Ou r 
algorithm is based on the importance weighted active learning (IWAL) framework introduced by [BDL09j . 
In that work, a rejection threshold procedure called loss-weighting is rigorously analyzed and shown to yield 
improved label complexity bounds in certain cases. Loss-weighting is more general than our technique in 
that it extends beyond zero-one loss to a certain subclass of loss functions such as logistic loss. On the 
other hand, the loss-weighting rejection threshold requires optimizing over a restricted version space, which 
is computationally undesirable. Moreover, the label complexity bound given in |BDL09j only applies to 
hypotheses selected from this version space, and not when selected from the entire hypothesis class (as the 
general IWAL framework suggests). We avoid these deficiencies using a new rejection threshold procedure 
and a more subtle martingale analysis. 

Many of the previously mentioned algorithms are analyzed in the agnostic learning model, where no 
assumption is made about the noise distribution (see also |Han07| ). In this setting, the label complexity 
of active learning algorithms cannot generally improve over supervised learners by more than a constant 
factor !Kaa061 fBDL09| . However, under a parameterization of the noise distribution related to Tsybakov's 
low- noise condition |Tsy04| , active learning algorithms have been shown to have improved label complexity 
bounds over what is achievable in the purely agnostic setting }CN061 IBBZ071 ICN071 IHan09[ IKol09] . We also 
consider this parameterization to obtain a tighter label complexity analysis. 

2 Preliminaries 
2.1 Learning Model 

Let T> be a distribution over X x y where X is the input space and y = {±1} are the labels. Let (X, Y) e 
X x y be a pair of random variables with joint distribution T>. An active learner receives a sequence 
(Xi,Y\), (Xi,Y2), ... of i.i.d. copies of (X,Y), with the label hidden unless it is explicitly queried. We 
use the shorthand a\-k to denote a sequence (oi, a%, . . . , a^) (so k = correspond to the empty sequence). 
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Let % be a set of hypotheses mapping from X to y. For simplicity, we assume % is finite but does not 
completely agree on any single x £ X (i.e., Vx E X,3h,h' <G T-L such that h(x) ^ h'(x)). This keeps the 
focus on the relevant aspects of active learning that differ from passive learning. The error of a hypothesis 
h : X — >• y is crr(/i) := Pr(h(X) 7^ Y). Let h* := argmin{crr(/i) : h £ T-L} be a hypothesis of minimum error 
in Ti. The goal of the active learner is to return a hypothesis h £ T-L with error err(/i) not much more than 
err(/i*), using as few label queries as possible. 

2.2 Importance Weighted Active Learning 

In the importance weighted active learning (IWAL) framework of [BDL09 , an active learner looks at the 
unlabeled data X\, Xi, ■ ■ . one at a time. After each new point Xi, the learner determines a probability 
Pi E [0, 1]. Then a coin with bias Pi is flipped, and the label Yi is queried if and only if the coin comes 
up heads. The query probability Pi can depend on all previous unlabeled examples Xi : j_x, any previously 
queried labels, any past coin flips, and the current unlabeled point Xj. 

Formally, an IWAL algorithm specifies a rejection threshold function p : (X x y x {0, 1})* x X — s- [0, 1] 
for determining these query probabilities. Let Qi G {0, 1} be a random variable conditionally independent 
of the current label Yi, 

Qi J- Yi I Xi-i, Yi : i-i, Qi.i-i 

and with conditional expectation 

ElQilZui-uXi] = P t := p(Z 1 .. i . 1 ,X i ). 

where Zj := (Xj,Yj,Qj). That is, Qi indicates if the label Yi is queried (the outcome of the coin toss). 
Although the notation does not explicitly suggest this, the query probability Pi = p(Zi-.i-i,Xi) is allowed 
to explicitly depend on a label Yj (j < i) if and only if it has been queried (Qj = 1). 

2.3 Importance Weighted Estimators 

We first review some standard facts about the importance weighting technique. For a function / : X x y — ► K, 
define the importance weighted estimator of E[/(X, Y)] from Z\. n £ (X x y x {0, 1})™ to be 

f(Z 1:n ) := - j^^-fiX^). 

Note that this quantity depends on a label Yi only if it has been queried (i.e., only if Qi = 1; it also depends 
on Xi only if Qi = 1). Our rejection threshold will be based on a specialization of this estimator, specifically 
the importance weighted empirical error of a hypothesis h 

en(h,Z 1:n ) := - V ^ ■ ? Yf\. 

l — l 

In the notation of Algorithm 1, this is equivalent to 

en(h,S n ) ■= - V (l/Pi) ■ llHXi) + Yi\ (1) 
n ^-^ 

(Xi,Yi,i/Pi)eS n 

where S n C X x y x R is the importance weighted sample collected by the algorithm. 
A basic property of these estimators is unbiasedness: 

1 " 

E[f(Z 1:n )} = -VEpE[(Q i /P 4 )./(X 4 ,y i ) | X 1:i! F 1:i ,Q 1:i _!]] 
n * — ' 

1 " 

= -^E[(p i /p i )-/(x,y i )] 

i=l 

= E[f(X,Y)}. 
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So, for example, the importance weighted empirical error of a hypothesis h is an unbiased estimator of its 
true error err(/i). This holds for any choice of the rejection threshold that guarantees P{ > 0. 



3 A Deviation Bound for Importance Weighted Estimators 

As mentioned before, the rejection threshold used by our algorithm is based on importance weighted error 
estimates err(/i, Zi :n ). Even though these estimates are unbiased, they are only reliable when the variance 
is not too large. To get a handle on this, we need a deviation bound for importance weighted estimators. 
This is complicated by two factors that rules out straightforward applications of some standard bounds: 

1. The importance weighted samples (Xi,Yi,l/Pi) (or equivalently, the Zi = (Xi,Yi,Qi)) are not i.i.d. 
This is because the query probability Pj (and thus the importance weight 1/Pi) generally depends on 
Ziu-i and Xi. 

2. The effective range and variance of each term in the estimator are, themselves, random variables. 

To address these issues, we develop a deviation bound using a martingale technique from |Zha05j . 

Let / : X x y — >• [— 1, 1] be a bounded function. Consider any rejection threshold function p : (X x y x 
{0, 1})* x X — > (0, 1] for which P n = p(Zi :n -i,X n ) is bounded below by some positive quantity (which may 
depend on n). Equivalently, the query probabilities P n should have inverses 1/P n bounded above by some 
deterministic quantity r max (which, again, may depend on n). The a priori upper bound r max on 1/P n can 
be pessimistic, as the dependence on r max in the final deviation bound will be very mild — it enters in as 
loglogr maa ,. Our goal is to prove a bound on \ f(Zi- n ) — E[f(X, Y)]\ that holds with high probability over 
the joint distribution of Z\. n . 

To start, we establish bounds on the range and variance of each term W% := (Qi/Pi) ■ f(Xi,Yi) in 
the estimator, conditioned on {X\-i,Y\-i,Q\-i-{). Let E.j[ • ] denote E[ • \X\;i,Y\;i,Q\.i—\[. Note that 
Ei[Wi] = (E t {Qi]/Pi) ■ f{Xi,Yi) = f(Xi,Yi), so if E,[V^] = 0, then W t = 0. Therefore, the (conditional) 
range and variance are non-zero only if Ej[Wj] ^ 0. For the range, we have \Wi\ = (Qi/Pi)-\f(Xi, Yi)\ < 1/Pi, 
and for the variance, E l [(^-E t [W i ]) 2 ] < (E^Qfj/Pf) • f(X t , Y) 2 < 1/P- These range and variance bounds 
indicate the form of the deviations we can expect, similar to that of other classical deviation bounds. 

Theorem 1. Pick any t > and n > 1. Assume 1 < 1/P < r max for all 1 < i < n, and let R n := 
1/ min({P : 1 < i < n A f(X h Y t ) ^ 0} U {!}). With probability at least 1 - 2(3 + log 2 r max )e-^ 2 , 



i » q. 

We defer all proofs to the appendices. 
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4 Algorithm 

First, we state a deviation bound for the importance weighted error of hypotheses in a finite hypothesis class 
T-L that holds for all n > 1. It is a simple consequence of Theorem [1] and union bounds; the form of the 
bound motivates certain algorithmic choices to be described below. 

Lemma 1. Pick any 5 € (0, 1). For all n > 1, let 

161og(2(3 + nlog 2 n)n(n + l)\H\/6) _ f \og(n\H\/S) 
n \ n 

Let (Zi, Z 2 , . • .) € (X x y x {0, 1})* be the sequence of random variables specified in Section \2.2\ using a 
rejection threshold p : (X x y x {0,1})* x X — > [0,1] that satisfies p{z\ :ni x) > l/n n for all (zi :n ,x) € 
(X x y x {0, 1})" x X and all n > 1. 
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Algorithm 1 

Notes: sec Eq. (JT]) for the definition of err (importance weighted error), and Section|4]for the definitions 
of Co , c\ , and C2 . 
initialize: Sq := 0. 
For k = 1,2, ... ,n: 

1. Obtain unlabeled data point X k . 

2. Let 

hk '■= argmin{err(/i, Sk-i) ■ h £ T-L}, and 

h' k := argmin{crr(/i,S fc _i) : h £ H A h(X k ) ^ h k (X k )}. 

Let G fc := err(h' k , S k -i) - en(hk,Sk-i), and 

ft: _fi if Gl .< v ^l + % ¥ UUfi 

[ s otherwise V I \ G k G kJ k ~ l )) 

where s £ (0, 1) is the positive solution to the equation 

3. Toss a biased coin with Pr(heads) = P k . 

If heads, then query Y k , and let S k := 5 fe _i U {(X k ,Y k , 1/P k )}. 
Else, let Sk := Sk-i- 
Return: h n +i := argmin{err(/i, S n ) : /i £ %}. 

Figure 1: Algorithm for importance weighted active learning with an error minimization oracle. 
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The following holds with probability at least 1 — 5. For all n > 1 and all h G %, 

\(exv(h, Z 1:n ) - CTr(h*,Z 1:n )) (crr(h) - crr(h*))\ < J p ,, , + p (4) 

where P lm „,„(/i) = min{Pj : 1 < i < n A h(Xi) ^ h*(Xi)} U {1} . 

We let Co = 0(\og(\H\/S)) > 2 be a quantity such that e n (as defined in Eq. ((3))) is bounded as 
£« < Co • log(n + l)/n. The following absolute constants are used in the description of the rejection threshold 
and the subsequent analysis: Cj := 5+2\/2, c 2 := 5, c 3 := ((ci+v / 2)/(ci — 2)) 2 , C4 := (ci + -y/c3) 2 , C5 := C2+C3 

Our proposed algorithm is shown in Figured] The rejection threshold (Step 2) is based on the deviation 
bound from Lemma[TJ First, the importance weighted error minimizing hypothesis hk and the "alternative" 
hypothesis h' k are found. Note that both optimizations are over the entire hypothesis class T-l (with h' k only 
being required to disagree with hk on Xk) — this is a key aspect where our algorithm differs from previous 
approaches. The difference in importance weighted errors Gk of the two hypotheses is then computed. If 
Gfc < y/ (Co log k) /{k — 1) + (Cologfc)/(fc — 1), then the query probability Pk is set to I. Otherwise, Pk is 
set to the positive solution s to the quadratic equation in Eq. The functional form of Pk is roughly 

. f / I 1\ Cplogfc ) 

It can be checked that Pk G (0, 1] and that Pk is non-increasing with Gk- It is also useful to note that 
(logfc)/(fc — 1) is monotonically decreasing with k > 1 (we use the convention log(l)/0 = 00). 

In order to apply Lemma [T] with our rejection threshold, we need to establish the (very crude) bound 
P k > l/k k for all k. 

Lemma 2. The rejection threshold of Algorithm 1 satisfies p(zi :n -i,x) > l/n n for all n > 1 and all 
(zi : „_i,x) G {X x y x {0, x X. 

Note that this is a worst-case bound; our analysis shows that the probabilities Pk are more like l/poly(fc) 
in the typical case. 



5 Analysis 

5.1 Correctness 

We first prove a consistency guarantee for Algorithm 1 that bounds the generalization error of the importance 
weighted empirical error minimizcr. The proof actually establishes a lower bound on the query probabilities 
Pi > 1/2 for Xi such that h n (Xi) ^ h*{Xi). This offers an intuitive characterization of the weighting 
landscape induced by the importance weights 1/Pj. 

Theorem 2. The following holds with probability at least 1 — 5. For any n > 1, 



,2C log7i 2C logn 

< err(ft n )-err(/T) < err(ft„, Z 1:n -i) - err(ft*, + \ 1 — . 

V n — 1 n — 1 

This implies, for all n > 1, 



err(M < err^*) + J™°*Z1 + ™EZ1. 

V n — 1 n — 1 

Therefore, the final hypothesis returned by Algorithm 1 after seeing n unlabeled data has roughly the 
same error bound as a hypothesis returned by a standard passive learner with n labeled data. A variant of 
this result under certain noise conditions is given in the appendix. 
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5.2 Label Complexity Analysis 

We now bound the number of labels requested by Algorithm 1 after n iterations. The following lemma 
bounds the probability of querying the label Y n ; this is subsequently used to establish the final bound on 
the expected number of labels queried. The key to the proof is in relating empirical error differences and 
their deviations to the probability of querying a label. This is mediated through the disagreement coefficient, 
a quantity first used by |Han07| for analyzing the label complexity of the A 2 algorithm of [BB L06| . The 
disagreement coefficient 9 := 9(h*,'H,'D) is defined as 

^^, P ): = Bup{ Pr ^ e ° IS ^^:r>0} 

where 

DISOV) := {x G X : 3ti G H such that Pv(h*(X) £ h'(X)) < r and h*(x) + h'(x)} 

(the disagreement region around h* at radius r). This quantity is bounded for many learning problems 
studied in the literature; see |Han071 IHan09l IFri09[ IWan09j for more discussion. Note that the supremum 
can instead be taken over r > e if the target excess error is e, which allows for a more detailed analysis. 

Lemma 3. Assume the bounds from Eq. Q holds for all h G % and n > 1. For any n > 1, 

E[Q n ) < 9- 2 err(^) + O ( 9 ■ J°***± + 9 ■ . 

\ V n — 1 n — 1 J 

Theorem 3. With probability at least 1 — 5, the expected number of labels queried by Algorithm 1 after n 
iterations is at most 

1 + 9 ■ 2 err(h*) ■ (n - 1) + O (() ■ y/C Q n\ogn + 9 ■ C log 3 nj . 

Proof. Follows from assuming Y\ is always queried; applying Lcmmas[l][2j[3l and linearity of expectation. □ 

The bound is dominated by a linear term scaled by err(h*), plus a sublinear term. The linear term 
err(h*) ■ n is unavoidable in the worst case, as evident from label complexity lower bounds |Kaa,06| [BDL09j . 
When err(/i*) is negligible {e.g., the data is separable) and 9 is bounded (as is the case for many problems 
studied in the literature |Han07j ). then the bound represents a polynomial label complexity improvement 
over supervised learning, similar to that achieved by the version space algorithm from |BDL09j . 

5.3 Analysis under Low Noise Conditions 

Some recent work on active learning has focused on improved label complexity under certain noise condi- 
tions |CN06| IBBZ07) ICN07| IHan09| lKol09] . Specifically, it is assumed that there exists constants « > and 
< a < 1 such that 

Pr(h(X) ^ h*(X)) < k ■ (err(/i) - crr(/i*)) Q (5) 

for all h G H. This is related to Tsybakov's low noise condition |Tsy04| . Essentially, this condition requires 
that low error hypotheses not be too far from the optimal hypothesis h* under the disagreement metric 
Pi(h*(X) ^ h{X)). Under this condition, Lemma [3] can be improved, which in turn yields the following 
theorem. 

Theorem 4. Assume that for some value of k > and < a < 1, the condition in Eq. ([5]) holds for all 
h G %. There is a constant c a > depending only on a such that the following holds. With probability at 
least 1 — 6, the expected number of labels queried by Algorithm 1 after n iterations is at most 

9- K -c a -(C logn) a/2 -n 1 -"/ 2 . 
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Note that the bound is sublinear in n for all < a < 1, which implies label complexity improvements 
whenever 8 is bounded (an improved analogue of Theorem [2] under these conditions can be established 
using similar techniques) . The previous algorithms of |Han091 IKol09j obtain even better rates under these 
noise conditions using specialized data dependent generalization bounds, but these algorithms also required 
optimizations over restricted version spaces, even for the bound computation. 

6 Experiments 

Although agnostic learning is typically intractable in the worst case, empirical risk minimization can serve 
as a useful abstraction for many practical supervised learning algorithms in non-worst case scenarios. With 
this in mind, we conducted a preliminary experimental evaluation of Algorithm 1, implemented using a 
popular algorithm for learning decision trees in place of the required ERM oracle. Specifically, we use the 
J48 algorithm from Weka v3.6.2 (with default parameters) to select the hypothesis hk in each round k; to 
produce the "alternative" hypothesis h' k , we just modify the decision tree hk by changing the label of the 
node used for predicting on Xk- Both of these procedures are clearly heuristic, but they are similar in spirit 
to the required optimizations. We set Co = 8 and c\ = C2 = 1 — these can be regarded as tuning parameters, 
with Co controlling the aggressiveness of the rejection threshold. We did not perform parameter tuning with 
active learning although the importance weighting approach developed here could potentially be used for 
that. Rather, the goal of these experiments is to assess the compatibility of Algorithm 1 with an existing, 
practical supervised learning procedure. 

6.1 Data Sets 

We constructed two binary classification tasks using MNIST and KDDCUP99 data sets. For MNIST, we 
randomly chose 4000 training 3s and 5s for training (using the 3s as the positive class), and used all of the 
1902 testing 3s and 5s for testing. For KDDCUP99, we randomly chose 5000 examples for training, and 
another 5000 for testing. In both cases, we reduced the dimension of the data to 25 using PCA. 

To demonstrate the versatility of our algorithm, we also conducted a multi-class classification experiment 
using the entire MNIST data set (all ten digits, so 60000 training data and 10000 testing data). This required 
modifying how h' k is selected: we force h' k {xk) ^ hk(xk) by changing the label of the prediction node for Xk 
to the next best label. We used PCA to reduce the dimension to 40. 

6.2 Results 

We examined the test error as a function of (i) the number of unlabeled data seen, and (ii) the number of 
labels queried. We compared the performance of the active learner described above to a passive learner (one 
that queries every label, so (i) and (ii) are the same) using J48 with default parameters. 

In all three cases, the test errors as a function of the number of unlabeled data were roughly the same for 
both the active and passive learners. This agrees with the consistency guarantee from Theorem [2j We note 
that this is a basic property not satisfied by many active learning algorithms (this issue is discussed further 
in |DH08j V 

In terms of test error as a function of the number of labels queried (Figure [2]) , the active learner had 
minimal improvement over the passive learner on the binary MNIST task, but a substantial improvement 
over the passive learner on the KDDCUP99 task (even at small numbers of label queries). For the multi- 
class MNIST task, the active learner had a moderate improvement over the passive learner. Note that 
KDDCUP99 is far less noisy (more separable) than MNIST 3s vs 5s task, so the results are in line with 
the label complexity behavior suggested by Theorem [3J which states that the label complexity improvement 
may scale with the error of the optimal hypothesis. Also, the results from MNIST tasks suggest that the 
active learner may require an initial random sampling phase during which it is equivalent to the passive 
learner, and the advantage manifests itself after this phase. This again is consistent with the analysis (also 
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Figure 2: Test errors as a function of the number of labels queried. 



see |Han07| ). as the disagreement coefficient can be large at initial scales, yet much smaller as the number 
of (unlabeled) data increases and the scale becomes finer. 



7 Conclusion 

This paper provides a new active learning algorithm based on error minimization oracles, a departure from 
the version space approach adopted by previous works. The algorithm we introduce here motivates com- 
putationally tractable and effective methods for active learning with many classifier training algorithms. 
The overall algorithmic template applies to any training algorithm that (i) operates by approximate error 
minimization and (ii) for which the cost of switching a class prediction (as measured by example errors) can 
be estimated. Furthermore, although these properties might only hold in an approximate or heuristic sense, 
the created active learning algorithm will be "safe" in the sense that it will eventually converge to the same 
solution as a passive supervised learning algorithm. Consequently, we believe this approach can be widely 
used to reduce the cost of labeling in situations where labeling is expensive. 

Recent theoretical work on active learning has focused on improving rates of convergence. However, in 
some applications, it may be desirable to improve performance at much smaller sample sizes, perhaps even at 
the cost of improved rates as long as consistency is ensured. Importance sampling and weighting techniques 
like those analyzed in this work may be useful for developing more aggressive strategies with such properties. 
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A Proof of Deviation Bound for Importance Weighted Estimators 

The techniques here are mostly developed in |Zha05] ; for completeness, we detail the proofs for our particular 
application. The first two lemmas establish a basic bound in terms of conditional moment generating 
functions. 

Lemma 4. For all n > 1 and all junctionals Sj := ^i(Zi-i), 



E 



exp -^lnE i [exp(S j )] 



Proof. A straightforward induction on n. □ 
Lemma 5. For all t > 0, A £ R, n > 1, and junctionals Si := ^{Zi-i), 

Pr ( A^Si-^SnEilexptAEi)] >t < e~*. 

\ i=l i=l / 

Proof. The claim follows by Markov's inequality and Lemma 0] (replacing Ej with AEj). □ 

In order to specialize Lemma [5] for our purposes, we first analyze the conditional moment generating 
function oiWi-Ei[Wi\. 

Lemma 6. I/O < A < 3Pi, then 

bxEt[exp{X{Wi-Ei\W^))] < ^ 



P, 2(1 - A/(3Pi)) ' 
//Ei[Wi] = 0, tften 

lnEi[exp(A(Wi-E4[Wi]))] = 0. 

Proof. Let g(x) := (exp(x) — x— l)/x 2 for a; ^ 0, so exp (x ) = l+x+x 2 -g(x). Note that g(x) is non-decreasing. 
Thus, 

EiiexpiHWi-EilWi]))] 

= Ei [1 + A(Wi - E,[Wi]) + A 2 (W, - E^]) 2 ■ ff (A(Wi - E^WJ))] 
= 1 + A 2 • Ej [(Wi - E,^]) 2 • ff (A(W< - Ei[Wi}))] 

< 1 + A 2 -Ei [(Wi-Ei[Wi}) 2 -g(X/Pi)] 
= l + X 2 -E i [(Wi-E i [W i }) 2 ]-g(X/Pi) 

< 1 + (A 2 / Pi) ■ g(X/ P t ) 

where the first inequality follows from the range bound \Wi\ < I /Pi and the second follows from variance 
bound Ej[(Wi — E^Wi]) 2 ] < l/Pi- Now the first claim follows from the definition of g(x), the facts exp(x) — 
x - 1 < x 2 /(2(l - x/3)) for < x < 3 and ln(l + x) < x. 

The second claim is immediate from the definition of Wj and the fact Ej[Wj] = f(Xi,Yi). □ 

We now combine Lemma |6] and Lemma [5] to bound the deviation of the importance weighted estimator 
f(Z 1:n ) from {l/n)Y:7 =1 ^i[Wi\. 

Lemma 7. Pick any t > 0, n > 1, and p m i n > ? and /e£ E be the (joint) event 



1 n i n 

71 — ^ rj ^ — ' 



1 2t 1 t 



^ ^ _^ V Pmin ^ Pmin 3fi 

and min{P; : 1 < i < n A E,-[Wj] ^ 0} > p m i„. 

TTierc Pr(E) < e"*. 
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Proof. With foresight, let 



A := 3p„ 



1 , _2i_ 



1 



3pmin 3n 

Note that < A < 3p m i n . By Lemma [6] and the choice of A, we have that if min{Pi : 1 < i < n A Ej[Wj] ^ 
0} > p m in, then 



1 1 

— • J2\nE i [eMKW l -E l [W l ]))] < — 

"A f-j' p min 2(1 



A 



A/(3p mi „)) V P 



t 

' 2n 



and 



t 

nX 



It It 

3n 



Pmin 271 p n 



(0) 



(7) 



Let E' be the event that 

-■ VCWi-EitWi])- — ■Y / lnE i [exp{HWi-E i [W i \))} > — 



and let E" be the event min{Pi : 1 < i < n A Ej[Wj] 7^ 0} > p ml „. Together, Eq. © and Eq. Q imply 
E CE' D E". And of course, £" n P" C E', so Pr(P) < Pr(£" n E") < Pr(P') < e"< by LemmaEl □ 

To do away with the joint event in Lemma we use the standard trick of taking a union bound over a 
geometric sequence of possible values for p m in- 

Lemma 8. Pick any t > and n > 1. Assume 1 < 1/Pj < r max for all 1 < i < n, and let R n := 1/ min{p : 
1 < i < n A Ej [Wi] ^ 0} U {1}. We /iaue 



Pr 



/ 1 n 1 n 

-E W *--E E « 

\ i— 1 z— 1 



> 



— + ^ < 2(2 + log 2 r rna:c )e-*/ 2 . 



n 3n 

Proof. The assumption on Pi implies 1 < R n < r max . Let rj := 2 J for —1 < j < m := |~log 2 r max \ . Then 

2R n t R n t \ 
n 3n ) 



( 1 - 1 - 

Pr -Wi-TEillfJ > 
V n n ^-^ 

\ i=l i=l 

m / n 1 n 

= £Pr -£^--£e^] > 

j=0 \ i=l i=l 
m / n 1 n 

^ E Pr -E^--E E « ^ 

j=0 \ i=l i=l 
m / n 1 n 

= E Pr -E^--E E « ^ 

j=0 \ i=l i=l 

< (2 + log 2 r max )e~ t/2 



2R ri t R n t _ 
n 3n 



Zrj-it + r 3 -_it 
n 3n 



A R n < r j 



2rj(f/2) | r 3 -(t/2) 
n 3n 



A i?„ < r 3 



where the last inequality follows from Lemma[7J Replacing W 7 ,; with — Wj bounds the probability of deviations 
in the other direction in exactly the same way. The claim then follows by the union bound. □ 

Proof of Theorem^ By Hocffding's inequality and the fact \ f(Xi,Yi)\ < 1, we have 



Pr 



^J2f{Xi,Yi)-E\f{X,Y)] > ^ 



: 2e- t ' 2 . 



Since Ej[Wj] = f(Xi,Yi), the claim follows by combining this and Lemma|S]with the triangle inequality and 
the union bound. □ 
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B Remaining Proofs 

In this section, we use the notation Sk := Co log(fc + l)/fc. 
B.l Proof of Lemma [5] 

By induction on n. Trivial for n = 1 (since p(cmpty sequence, x) ~ 1 for all x £ X), so now fix any 
n > 2 and assume as the inductive hypothesis p n -i = p(zi-.n—2,%) > — for all (z± :n -2,x) E 

(X x y x {0, 1})™~ 2 x X. Fix any (zi : „_i,x) £ (X x y x {0, x X, and consider the error difference 
g n := err(/i^, 2;i : „_i) — err(h n , 2i :n _i) used to determine p n := p(zi : „_i, x). We only have to consider the 
case <?„ > y/E„-i + e n -\. By the inductive hypothesis and triangle inequality, we have g n < 2(n — l) n . 
Solving the quadratic in Eq. ^ implies 

Cl • y/En-! + Jc\- e„_i + 4 • (g n + (ci - 1) • y/e n -i + (c 2 - 1) • £,j-l) • c 2 • £„-i 

/"m — 

2 ( 5 „ + (ci - 1) • Te^TT + (c 2 - 1) • e n -i) 

J 4 • (g n + (ci - 1) • ^£^1 + (C2 - 1) ■ £ n -i) • c 2 • £ n -i 

> - — — r (dropping terms) 

2 [g n + (ci - 1) ■ y/e^l + (c 2 - 1) ■ e n _i) 

/ c 2 • e n -i 



g n + (ci - 1) • y/e n -! + (c 2 - 1) • e„_ 



— \ 7 s 7= — 7 s (since c 2 < cij 

V 5n + (Cl - 1) • y/£ n -l + (Cl - 1) • £«-l 



. /C 2 • £n-l . . » 

>W (since > ^£„_i + £„-i) 

\ ci- g n 



I c 2 • C log n 
ci • (n - 1) • 5„ 



> < / ci Co log n — — (inductive hypothesis) 

~ V 2ci • (n- 1) • (n- l)"- 1 v ; 



> J 7 1 ^„ ( sincc Co > 2, n > 2, and (c 2 • C log2)/(2ci) > 1/e) 
y e(n — l) n 

> \/— (sincc (n/(n— 1))" > e) 
V n n 

as required. □ 
B.2 Proof of Theorem [2 

We condition on the 1 — 6 probability event that the deviation bounds from Lemma [T] hold (also using 
Lemma [2]). The proof now proceeds by induction on n. The claim is trivially true for n — 1. Now pick any 
n > 2 and assume as the (strong) inductive hypothesis that 



< err(/i fc ) - err(/i*) < err{h k , Z 1:k _ 1 ) - err(/i*, 2i !fc -i) + V 2s fe-i + 2s k-i (8) 

for all 1 < k < n — 1. We need to show Eq. ([5J holds for k = n. 

Let F mm := min^ : 1 < i < n - 1 A /i„(X ( ) 7^ u I 1 }- If P mtn > 1/2, then Eq. g]) implies 

that Eq. (J5J holds for fc = n as needed. So assume for sake of contradiction that P m i n < 1/2, and let 
no ;= max{« < n — 1 : Pj = P m i„ A h n (Xi) ^ /i*(Xj)}. By definition of P no , we have 



err(ft^, Zi !no _i) - err(/i„ , Zi : „ _i) = ( - c x + 1 J V£„ -i + ( -75^ c 2 + 1 
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Using this fact together with the inductive hypothesis, we have 
err(/4 ,-Zi:n -i) - err(ft*, Zi :no -i) 

= err(/i' ,Zi :no _i) - err(/i„ , Zx. no -i) + err(ft no ,Zi :no _i) - err(/i*, Zi : „ _i) 



no ' 

> ( ^— - Cl + 1 ) • V £ «o-i + ( ^ C2 + 1 ) • e„ -i - V 2e «o-i - 2 £ « -i 

\ V "rain / V Mnin / 

= f-#= - Cl + 1 - -V^T+ ( 77^ -C2-1) ■£„„-! • (9) 

V V "min J \ "min J 

We use the assumption P m i„ < 1/2 to lower bound the righthand side to get the inequality 

err(/4 ,Zi : „ _i) - err(/i*, Z 1: „ _i) > (ci - 1) • (\/2 - 1) • V e "o-i + ( c 2 - 1) ■ £« -i > °- 

which implies err(h' no , Zi :no -i) > err(/i*, Zi :no -i). Since /i^ o minimizes err(h, Zi :no -i) among hypotheses 
h £H that disagree with /i„ on X ng , it must be that h* agrees with h nQ on X na . By transitivity and the 
definition of no, we conclude that h n {X no ) = h' no (X no ); so err(/i„, Zi : „ _i) > err(ft4 , Z 1:no -i). Then 

err(/i n , Zi :n _i) - err(/i*, Zi : „_i) 



> err(/i„) - err(/i*) - ^/ — £„_i - — — • e„_i 

■Lmin -Lrnin 



> err(/i„, Zi : „ _i) - err(/i*, Z 1: „ _i) - 2- \j — e no -i - 2 • — — ■ e no -i 



> (( Cl - 1) • (V2 - 1) - 2V2) •V^T+(c 2 -5) 



-no — 1 



where Eq. ((4]) is used in the first two inequalities, Eq. ([9]) and the fact err(/i„, Zi :no _i) > err(/i^ , ^i :no -i) 
are used in the third inequality, and the fact P rn i„ < 1/2 is used in the last inequality. This final quantity 
is non-negative, so we have the contradiction err(/i„, Z\-_ n -\) > crr(ft*, Zi : „_i). □ 

B.3 Proof of Lemma H 

First, we establish a property of the query probabilities that relates error deviations (via P m in) to empirical 
error differences (via P n ). Both quantities play essential roles in bounding the label complexity through the 
disagreement metric structure around h*. 

Lemma 9. Assume the bounds from Eq. (|4]) hold for all h G H and n > 1. For any n > 1, we have 
Pn < c 3 ■ P m in, where P min := min({Pj : 1 < i < n - 1 A ft(Xi) ^ /i*(X 4 )} U {1}) ant! 

f /i„ «/ /i„ disagrees with h* on X n . . 

\ */ disagrees with h* on X n . 

Proof. We can assume P, n i„ < I/C3, since otherwise the claim is trivial. Pick any no < n — 1 such that 
h(X no ) h*(X no ) and P„ = Pmin (such an no is guaranteed to exist given the above assumption). We now 
proceed as in the proof of Theorem [3J We first show a lower bound on err(/i, Z 1:rao _ 1 ) — crr(/i*, Z 1:Jln _ 1 ). 
Note that 

err(/4 ,Zi !no _i) - err(/i*, Z 1:jl0 _i) 

= crr(/i' no , Zi : „ _i) - err(/i„ , Zi : „ _i) + err(/i„ , JSi :no _i) - err(/i*, Zi :no -i) 

> ( 7=== - Cl + 1 ) • V £ «o-l + ( C2 + 1J -Eno-l - v / 2£ri -l - 2e„ _i 

\ V "min J \ "min J 

= (^=-Ci + l- v^) • V^T+ (tT^- c 2-i) ■£„„_! (11) 
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where the inequality follows from Theorem [5J The righthand side is positive, so h* must disagree with 
h' no on X no . By transitivity (recalling that h(X„ ) 7^ h*(X no )), h must agree with h' n on X„ . Therefore 
err(/i, Z\. na -i) — erc(h' no , Zi :no _i) > 0, so the inequality in Eq. holds with h in place of /i^ o on the 
lcfthand side. 

Now err(/i, Z\ :n -{) — crr(/i*, Zx :rl _x) is related to err(/i, Z\. nQ _\) — err(/i*, Zx :no -x) through err(/i) — err(/i*) 
using the deviation bound from Eq. @ (as well as the fact £ no _x > £ n -i): 

err(/i, Z 1: „_x) - err(/i*, Z 1: n-i) 



> err(/i, Z 1:no _x) - err(ft*, Z 1: „ _x) - 2 • < / — — • e„ _i - 2 • — e„ _i 

V *min *min 

> ( ^±==L - a + 1 - \/2 ) ■ v^T+ (%^-c 2 -i) ■£„_! > 0. (12) 

If /). = /j„, then err(/i, Z\-. n -\) — err(/i*, Z\ :n -\) = err(/i„, Z\. n -\) — err(/i*, Zi : „_i) < by the minimality of 
err(/i„, Zi : „_i); this contradicts Eq. (fT2")l. Therefore it must be that /i = /i^. In this case, 

err(/i, Zi : „_x) - err(/i*, Zx :n -i) < err(/i^, Z 1: „_x) - err(/i n , Z 1: n-i) 

-^=-cx + l^ • V^T+ f^--c 2 + lj (13) 

where the inequality follows from the minimality of err(/i„, Zi : „_x), and the subsequent step follows from 
the definition of P n . Combining the lower bound in Eq. (|12[) and the upper bound in Eq. (|13[) implies that 

Cx , C2 / cx — 2 r-\ / C2 — 2 

-7= ' V^TI + 7J- • > rp — ; — - V 2 J ■ VF^TT + — 2 

V * n *n \ V 1 vain / \ mm 

It is easily checked that this implies P„ < c 3 • P mm . □ 

Proof of Lemma\^ Define h as in Eq. (fTU|) . By Lemma HO we have min({Pj : 1 < i < n — 1 A h(Xi) ^ 
h*{Xi)} U {1}) > P„/c 3 . We first show that 



err(/i) - err(/i*) < err(/i, Z 1: „_x) - err(/i*, Z 1:n -x) + J — ■ £„-x + 77- ■ e n -i 

V n n 

/ C4 , C5 

< WtT' V^T+yf (14) 

The first inequality follows from Eq. @ and Lemma HO For the second inequality, we consider two cases 
depending on h. If h = h' n , then we bound err(/i, Zx :n _x) — err(h*, Zx :n — 1) from above by err(/i^, ^x:n— 1) — 
err(/i„, Z\., n -\) (by definition of /i and minimality of err(/i„, Zx :n -i)), and then simplify 



err(/4, , Z 1:n -x) - err(/i„, Z 1:n _i) + £«-i + 77- • £«.-! 



< 



p <* >■ • p 

± n 1 n 

- Cx + 1 • y/£n-l + 5 C 2 + 1 • £„-X < \ — ■ y/£ n -l + -J" ' 



p / V ,b x ' \ p * / — \/P v P 

J n / \ n / V n n 



using the definition of P n and the facts ci > 1 and c 2 > 1. If instead h = h n , then we use the facts 
err(/i, Z 1:n _x) - err(/i*, Z 1:n _x) = err(/i„, Z 1:n _x) - crr(/i*, ^x:n-i) < and c 3 < min{c 4 ,c 5 }. 

If err(/i) — err(/i*) = 7 > 0, then solving the quadratic inequality in Eq. (|14p for P n gives the bound 

P w < minjl, 1. (^ + ^) -£n-i 

If err(/i) — err(/i*) < 7, then by the triangle inequality we have 

Pr(h*(X) ^h(X)) < err(h*) + erc(h) < 2err(/i*)+7 
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which in turn implies X„ £ DIS(/i*,2err(/i*)+7). Note that Pr(X n £ DIS(/i*,2err(fc*)+7)) < 0-(2err(/i*) + 
7) by definition of 0, so Pr(err(ft) - err(/i*) < 7) < • (2err(/i*) + 7). 

Let 7(7) := <9Pr(crr(/i) — err(/i*) < "f)/dj be the probability density (mass) function of the error difference 
eir(h) — err(/i*); note that this error difference is a function of {Z\- n -\, X n ). Wc compute the expected value 
of Q n by conditioning on err(/i) — err(/i*) and integrating (an upper bound on) E[Q n | err(/i) — err(/i*) = 7] 
with respect to /(7). 

Let 70 > be the positive solution to 1.5(04/ j 2 +c^/j)e n -i = 1- It can be checked that 70 > -\/1.5c 4 £„_i. 
We have 

IE[Q T i] = EpE[Q n |Zi :n _i, X n ]] (the outer expectation is over (Zi :n -i,X n )) 
= J (-^- Pr(err(ft) - err(ft*) < 7)^ ■ E[Q„| err(/i) - err(/i*) = 7] ■ c/7 

< ^ ^ Pr(err(ft) - err(ft*) < 7)) ■ min |l, | • ^ + ^ ■ e^j ■ d 7 
3 

< 2 ' ( c 4 + c 5) ' £n-i ■ Pr(err(/i) - err(/i*) < 1) 

- / (Jff min j 1 ' I ' + ^) ' £ "- 1 }^) ' Pr ( err ( /l ) - err(ft*) < 7) ■ d-f 
<y(c4 + c 5 ) ■ + y I • + ■ £„_! ■ ■ (2err(/**) + 7) ■ d 7 



I ■ (c 4 + c 5 ) ■ e„_i + 6> • 2crr(/i*) • | ■ ^c 4 ^ - 1^ + c 5 ^i- 



+ ,.L( 2c4 (l-l) + c 5 lni). £n _ 1 
< I ' ( c 4 + cs) ■ e n -i + 6 ■ 2err(/i*) + 6> • v /6c 4 e n _i + 6> • ^ • e„_i ■ In 



2 4 1.5c 4 £ rl -l 

where the first inequality uses the bound on E[Q„| err(/i)— err(/i*) = 7]; the second inequality uses intcgration- 
by-parts; the third inequality uses the fact that the integrand from the previous line is for < 7 < 70, as 
well as the bound on Pr(err(/i) — err(/i*) < 7); and the fourth inequality uses the definition of 70. □ 

B.4 Proof of Theorem [4] 

The theorem is a simple consequence of the following analogue of Lemma [3] 

Lemma 10. Assume that for some value of k > and < a < 1, the condition in Eq. ([5]) holds for all 
h £ H. Assume the bounds from Eq. ^ holds for all h £ % and n > 1. There is a constant c a > such 
that the following holds. For any n > 1, 



E[Q n ] < 



Co log n x 
n-1 



Proof. For the most part, the proof is the same as that of Lemma [3J The key difference is to use the 
noise condition in Eq. ([5]) to directly bound Pr(h(X) ^ h*(X)) < k ■ (err(/i) — err(/i*)) Q , which in turn 
implies the bound Pr(err(/i) — err(/i*) < 7) < 0K^ a . As before, let 70 > A /1.5c 4 £„_i be the solution to 
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1.5(c4/7 2 + c 5/l)^n-i = 1- First consider the case a < 1. Then, the expectation of Q n can be bounded as 

E[Qn] < | ' (C4 + CB) • £„-! + jf | • + ^ ) 

nl 3 / 2c4 C5 



2 Ao 2 V T T J 

3 3 / 2c4 1 C5 



< - • (c 4 + Cr) • £„-i + • K ■ — ■ • — g 1 ■ • — j • £„_! . 

~2 V4 5J 2 \2-a 7 2-« 1-a J 

The case a = 1 is handled similarly. □ 

B.5 Analogue of Theorem [2] under Low Noise Conditions 

We first state a variant of Lemma [1] that takes into account the probability of disagreement between a 
hypothesis h and the optimal hypothesis h*. 

Lemma 11. There exists an absolute constant c > such that the following holds. Pick any S G (0, 1). For 
all n> 1, let 

c-\og((n+l)\n\/6) 

£n ■= ■ 

n 

Let (Z\, Z 2 , . . .) G (X x y x {0, 1})* be the sequence of random variables specified in Section \ 2.2\ using a 
rejection threshold p : (X x y x {0,1})* x X — > [0,1] that satisfies p(zi :n ,x) > l/n n for all (zi :n ,x) G 
(X x y x {0, 1})" x X and all n > 1. 

TTie following holds with probability at least 1 — S. For all n > 1 and all h ^T-L, 



|(err(ft,Z 1:B ) - err^*, Z 1:n )) - (err(fc) - err(/i*))| < J ^^^^™ ' £ * 



Pmin.n (h) Pmin,n (fo) 

w/iere P mm ,„(/i) = min{P : 1 < i < n A 7^ ^*(^)} U {1}. 

Proof sketch. The proof of this lemma follows along the same lines as that of Lemma [T] A key difference 
comes in Lemma [71 the joint event is modified to also conjoin with 

1 " 

-yi(E,[/(I„y,)]<0) < a 

i=l 

for some fixed a > 0. In the proof, the parameter A should be chosen as 



2at 

v 3p mln 3n 

A := 3p„ 



'ill I II 



^~3j~ ' In 

Lemma [5] is modified to also take a union bound over a sequence of possible values for a (in fact, only 
n + 1 different values need to be considered). Finally, instead of combining with Hocffding's inequality, 
we use Bernstein's inequality (or a multiplicative form of Chernoff's bound) so the resulting bound (an 
analogue of Theorem [1} involves an empirical average inside the square- root term: with probability at least 
1 - 0(n ■ log 2 r max )e~ t/2 , 



1 " Q. 

-£^-/(^,*i)-E[/(X,Y)] 

1—1 



: O 



Pji-A n t Rnt 

n 3n 
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where 

1 - 

4:=Ti(/(ii,y^o). 

n * — ' 

i=l 

Finally, we apply this deviation bound to obtain uniform error bounds over all hypotheses % (a few extra 
steps are required to replace the empirical quantity A„ in the bound with a distributional quantity). □ 

Using the previous lemma, a modified version of Theorem [2] follows from essentially the same proof. We 
note that the quantity C\ := 0(\og(\W\/5)) used here may differ from Co by constant factors. 

Lemma 12. The following holds with probability at least 1 — 5. For any n > 1, 

< err(/i„) - err(ft,*) < err(/i n , Zi :n _i) - err(/i*, Zi : „_i) 



2Pr(h n (X) h*(X))dlogn 2C 1 logn 



n — 1 n — 1 



This implies, for all n > 1, 



err^) < err(^) + , / 2P r( M X) * M*))^ jgjli + g^2^. 

V n — 1 n — 1 

Finally, using the noise condition to bound Py(h n (X) ^ h*(X)) < k ■ (err(h n ) — err(h*)) a , we obtain the 
final error bound. 

Theorem 5. The following holds with probability at least 1—5. For any n > 1, 

G\ logn\ 2 - a 



evv(h n ) < err(h*) + c K 
where c K is a constant that depends only on K. 



n-1 
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